[SPARK-30682][SPARKR][SQL] Add SparkR interface for higher order functions#27433
[SPARK-30682][SPARKR][SQL] Add SparkR interface for higher order functions#27433zero323 wants to merge 6 commits intoapache:masterfrom
Conversation
|
Test build #117731 has finished for PR 27433 at commit
|
|
Test build #117743 has finished for PR 27433 at commit
|
|
Test build #117747 has finished for PR 27433 at commit
|
|
Nice! cc @felixcheung and @shivaram. |
|
cc @falaki too fyi |
|
Test build #117796 has finished for PR 27433 at commit
|
|
Test build #117801 has finished for PR 27433 at commit
|
| parameters <- formals(fun) | ||
| nparameters <- length(parameters) | ||
|
|
||
| stopifnot( |
There was a problem hiding this comment.
@zero323, can we remove this one too here for now? Let's discuss and figure out a better way in the next PR about this.
There was a problem hiding this comment.
Maybe to some lesser extent (as variation of argument type is smaller, but so is amount of logic required), but overall same as here ‒ #27406 (comment).
|
Test build #117987 has finished for PR 27433 at commit
|
|
Generally it looks good. Thanks for working on this.
Please see functions.R L247 for extra space that shouldn’t be there
|
|
Test build #118074 has finished for PR 27433 at commit
|
|
Test build #118078 has finished for PR 27433 at commit
|
|
retest this please |
|
Okay, let's merge this in. I will take a separate look for a followup if it's needed. |
|
Test build #118891 has finished for PR 27433 at commit
|
|
Merged to master. |
|
Thanks a bunch for your support @HyukjinKwon @felixcheung! |
…tions ### What changes were proposed in this pull request? This PR add R API for invoking following higher functions: - `transform` -> `array_transform` (to avoid conflict with `base::transform`). - `exists` -> `array_exists` (to avoid conflict with `base::exists`). - `forall` -> `array_forall` (no conflicts, renamed for consistency) - `filter` -> `array_filter` (to avoid conflict with `stats::filter`). - `aggregate` -> `array_aggregate` (to avoid conflict with `stats::transform`). - `zip_with` -> `arrays_zip_with` (no conflicts, renamed for consistency) - `transform_keys` - `transform_values` - `map_filter` - `map_zip_with` Overall implementation follows the same pattern as proposed for PySpark (apache#27406) and reuses object supporting Scala implementation (SPARK-27297). ### Why are the changes needed? Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions: ```r select(df, expr("transform(xs, x -> x + 1)") ``` This is error-prone, and hard to do right, when complex logic is used (`when` / `otherwise`, complex objects). If this PR is accepted, above function could be simply rewritten as: ```r select(df, transform("xs", function(x) x + 1)) ``` ### Does this PR introduce any user-facing change? No (but new user-facing functions are added). ### How was this patch tested? Added new unit tests. Closes apache#27433 from zero323/SPARK-30682. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…e in higher order functions ### What changes were proposed in this pull request? This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with #31062. Note that: - there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary. - such naming matching seems already pretty common in SparkR. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Also, I manually tested: ```r > df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs")) > collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y)))) aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id)) 1 2 ``` Closes #31226 from HyukjinKwon/SPARK-30682. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…e in higher order functions ### What changes were proposed in this pull request? This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with #31062. Note that: - there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary. - such naming matching seems already pretty common in SparkR. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Also, I manually tested: ```r > df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs")) > collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y)))) aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id)) 1 2 ``` Closes #31226 from HyukjinKwon/SPARK-30682. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b5bdbf2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…e in higher order functions ### What changes were proposed in this pull request? This PR is a followup of apache#27433. It fixes the naming to match with Scala side, and this is similar with apache#31062. Note that: - there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary. - such naming matching seems already pretty common in SparkR. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Also, I manually tested: ```r > df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs")) > collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y)))) aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id)) 1 2 ``` Closes apache#31226 from HyukjinKwon/SPARK-30682. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR add R API for invoking following higher functions:
transform->array_transform(to avoid conflict withbase::transform).exists->array_exists(to avoid conflict withbase::exists).forall->array_forall(no conflicts, renamed for consistency)filter->array_filter(to avoid conflict withstats::filter).aggregate->array_aggregate(to avoid conflict withstats::transform).zip_with->arrays_zip_with(no conflicts, renamed for consistency)transform_keystransform_valuesmap_filtermap_zip_withOverall implementation follows the same pattern as proposed for PySpark (#27406) and reuses object supporting Scala implementation (SPARK-27297).
Why are the changes needed?
Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions:
This is error-prone, and hard to do right, when complex logic is used (
when/otherwise, complex objects).If this PR is accepted, above function could be simply rewritten as:
Does this PR introduce any user-facing change?
No (but new user-facing functions are added).
How was this patch tested?
Added new unit tests.